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Abstract 

We have accelerated an ab-initio QMC electronic structure calculation using General Purpose 
computing on Graphical Processing Units (GPGPU). The part of the code causing the bottleneck 
for extended systems is replaced by CUDA-GPGPU subroutine kernels which build up spline basis 
set expansions of electronic orbital functions at each Monte Carlo step. We have achieved a speedup 
of a factor of 30 for the bottleneck for a simulation of solid T1O2 with 1,536 electrons. To improve 
the performance with GPGPU we propose a new updating scheme for Monte Carlo sampling, quasi- 
simultaneous updating, which is intermediate between configuration-by-configuration updating and 
the widely-used particle-by-particle updating. The error in the energy due to by the single precision 
treatment and the new updating scheme is found to be within the required accuracy of ~ 10 -3 
hartree per primitive cell. 
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I. INTRODUCTION 



General Purpose computing on Graphical Processing Units (GPGPU) has attracted 
recent interest in HPC (High Performance Computing) for accelerating calculations at a 
reasonable cost. Environments for developing GPGPU, such as CUDA (Compute Unified 
Device Architecture) js, 4], have also contributed to the recent trend for using GPGPU for 
scientific applications with much increased portability ^|. Electronic structure calculations 
[|j] form one of the largest fields within HPC and there have been many attempts to accelerate 
such calculations using GPGPU {2]. Electronic structure calculation using quantum Monte 
Carlo (QMC) methods can provide highly reliable estimates of material properties for a wide 



range of compounds 



7 



10j. The very high computational demands are not so important 



because of the inherently high efficiency of massively parallel computational facilities for 



Monte Carlo computations 



111 ]. There have been several attempts to apply GPGPU to 



ab-initio QMC electronic structure calculations 



12 



13|- 



Previously we reported GPGPU acceleration of a QMC calculation for molecular systems, 
in which we achieved a speedup of more than a factor of 20 [ijj]. The key idea was to replace 
only the bottleneck subroutines in the main code by the CUDA kernel running on the GPU. 
We emphasized that the replacement of the entire simulation code by its GPU version is not 
practical from the viewpoint of version administration 12| . This becomes more serious for 
practical program packages with large number of users, as is common in ab-initio electronic 



structure simulations 



12|. 



It was challenging to achieve substantial acceleration using such a 'partial replacement 
strategy', and it should give a speedup of at least more than a factor of ten to be advantageous 
to use multi-core processor technology. In Ref.[l2] the main code written in Fortran90 (F90) 
was partially replaced by the GPU kernel, which were at the object code level. Users could 
switch back to the original CPU version of the subroutine if the GPU was not available. In 
the previous study GPGPU was applied to molecular simulations, although solid systems 
are the most attractive target for GPU-QMC electronic structure simulations [lo| because 
of the vast CPU time required and the potential of QMC to achieve more reliable results 
than frameworks such as density functional theory (DFT). 



The bottleneck in the present work differs from that in our previous molecular simulation 
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12[. In our previous work the bottleneck was the routine for computing the Hartree fields 
corresponding to the particle configuration [ijj]. In the present work the bottleneck is the 
routine for evaluating the single particle orbitals at the required particle positions. We have 
achieved a speedup of more than a factor of 30 with GPGPU compared with the single core 
performance of the conventional CPU evaluation. This acceleration does not, in principle, 
harm the MPI (massively parallel interface) parallelization efficiency, which is essentially 



the same as in our previous work 



12| . The conventional MPI parallel evaluation 10J can be 



accelerated further by attaching a GPU to each node. In QMC calculations the electronic 
orbitals are calculated many times at different electronic positions. It is quite common in 
ab-initio electronic structure methods, including QMC, that one builds up orbital functions 
for given electronic positions. Our implementation achieved here would be useful also in 
self-consistent field (SCF) methods used in density functional theories (DFT) or molecular 
orbitals (MO) methods. 

MPI parallelization has successfully been used in QMC electronic structure calculations 
0, 7 , [lO | , obtaining ~ 99% parallel efficiency by distributing the huge number of config- 
urations over the processing nodes. On each node the evaluation is usually sequential, 
though there have been several attempts to exploit further parallelization within the node 
using, for example, OpenMP |fj|. The evaluations performed on each node include up- 
dating an electronic configuration = (r[ , ■ ■ ■ \ • ■ ■ ), and sampling with the 
updated configuration, where a is the index for MC steps. There are two major types of 
updating scheme, the configuration-by-configuration scheme (simultaneous updating) and 
the particle- by-particle scheme (PbP, sequential updating). In the former, attempted trial 
iV-electron moves are generated to update a configuration, — > R^ a+1 \ and then the 
new configuration is accepted or rejected. In the latter, a trial move of a single electron 
is attempted and accepted or rejected, fj — > r^ +1 \ and the process is repeated N times. 
Sequential updating is more efficient than simultaneous updating, and it is widely used in 
QMC electronic structure calculations jfj]. For hybrid parallelization, including GPGPU and 
OpenMP, one seeks further parallelization in the sequential evaluation within an MPI node. 
The GPGPU performs the accept/reject steps for each particle 'simultaneously'. The ratio 
of the probabilities evaluated in the Metropolis accept/reject algorithm j3] differs both from 
those for simultaneous and sequential updating. In this sense our updating scheme can be 
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viewed as 'quasi-simultaneous updating' (Q.S.). This scheme is designed to obtain GPGPU 
acceleration by improving the sequential memory access (so called 'coalescing'), and the con- 
cealment of memory latencies. We have confirmed that our new updating scheme does not 
change the results within the required statistical accuracy, namely the chemical accuracy. 

The paper is organized as follows. In §11 we briefly summarize the VMC method (Varia- 
tional Monte Carlo method). The evaluation of the orbitals represented in a spline basis set 
is the bottleneck in the computation, as described in this section. The benchmark systems 
used in the performance evaluation are also introduced. §111 is devoted to a description 
of the GPU architecture. The structure of processors and memories in the GPU used in 
the present work is briefly explained. Other features, such as how we assign the number 
of threads and blocks for parallel processing, are discussed in §IV, in connection with the 
design and implementation of the quasi-updating scheme. The quasi-updating scheme is also 
introduced in this section. Several other possible implementations with different updating 
schemes or thread/block assignments are introduced here and their performances are com- 
pared. The results are summarized in §V, including comparisons of the energies, operation 
costs and data transfers, and the dependence of the performance on system size. In §VI 
we discuss the results, comparing with the ideal performance in terms of operations and 
memory access. We also discuss the possibility of using high-speed memory devices in the 
GPU and the relation to linear algebra packages. 



II. QMC ELECTRONIC STRUCTURE CALCULATION 



VMC 



In afr-zmtao_calculations the system is specified by a hermitian operator H called the 
Hamiltonian 



151 ] . The operator includes information about the positions and charges of the 
ions, the number of electrons, and the form of the potential functions in the system. The 
fundamental equation at the electronic level is the many-body Schrodinger equation, which 
takes the form of a partial differential equation with H acting on a multivariate function 
\& (r*i, • ■ • ,r*jv), known as the many-body wave function, where N denotes the number of 
electrons. The energy of the system, E, is the eigenvalue of the partial differential equation. 
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The equation has the variational functional [3] 

/ V*HV dfx - ■ ■ df N 



E 



J ^*^dfi ■ ■ -df N 
J \^\ 2 ■ fy" 1 ^ df x ■ • ■ df N 



(1) 



f l^l 2 df\ ■ ■ ■ dfjsr 

which is minimized when the above integral is evaluated with \l/ being an exact solution of 
the eigen equation. For a trial \& the functional can be evaluated as an average of the local 
energy, El (fx, • • • , fjv) = ^~ 1 H^ over the probability density distribution 



Pin, 



Tat, 



\I>| 2 / / Y&\ 2 dn ■■■dr. 



N 



(2) 



In VMC the energy is evaluated by Monte Carlo integration using the Metropolis algorithm 
to generate configurations ji?'- "' j distributed according to the probability distribution 



p(R), where R denotes a configuration (fx, ■ ■ ■ ,f N ), as 



1 M 



(3) 



with M being typically of the order of millions. The trial function \l/ is improved by an opti- 
mization procedure so that the integral of Eq. (0Q) can be minimized numerically js, 9]. Since 
each El (r^J is evaluated independently, the summation over a can be divided into sub- 
summations distributed over the processors by MPI with high efficiency [loj]. In this work 
GPGPU is used to accelerate the evaluation of each El ( R^ ) , rather than parallelization 



over the suffix a. We used the 'CASINO' program package 



6] for the VMC calculations. 



There are several possible forms of trial and we chose to use the common Slater- Jastrow 



type wave function |7|, nj 



*sj [R 

where e J (^ is known as the Jastrow factor 



e J y R ) ■ * D [ R 



(4) 



16 



13]. is a Slater determinant [18] 



*d (fi, 



■01 (fi; 



0i if N ) 



(fx] 



4>N (fW) 



(5) 



which is an anti-symmetrized product of one-particle orbital functions, {ipi {f)}f =l - The 
number of independent orbitals, L, can be reduced by using the symmetries of the system. 
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The bottleneck of the whole simulation has been found to be the construction of the {ipi (r)} 
(J. In this study the computational power of the GPU is devoted to the bottleneck process, 
as described in the following subsection. 



B. Orbital evaluation 

In each MPI process the following evaluations are performed sequentially: 

1. An attempted move, R( a ' — > w- a+1 \ is randomly generated, 

2. The updated probability p(w a+1 ^) and the ratio £ = p(R^ a+1 ^) / 'p(R^) is evaluated, 

3. Based on the ratio £, the attempted move R^ a+l ^> is accepted or rejected, 

4. The local energy El (^R^ a+1 ^ is evaluated. 

Each configuration is a set of electronic positions (we omit the spin coordinate for simplicity) , 
R( a ) = (j^ 1 } r^, Q ) ; . . . ,7^ ,rjy ). Following Eqs. ([2]), (jlj), and (EJ), one can reduce the 
evaluation of the ratio £ = p{R^ a+1 ^) / p{R^) to that of the orbital functions, W)\ y^^j j- 
In practical QMC calculations for extended systems the orbital functions are expanded in a 
5-spline basis set {6, (f)} Q, Q, 20| as 

4 3 =64 

^ = ais ' Qs ft) ■ ( 6 ) 

s=l 

The index s runs over the subset of the spatial sites within the unit cell of the periodic 
system. The spline basis functions, {0 S (r)}, have non-zero values only at sites s within the 
fourth nearest neighbor of the position r along each direction. The total number of terms 
in the summation ([6]) is therefore 64 = 4 3 (four spatial points along each direction in the 
three dimensional space), as a subset of the whole lattice within the unit cell amounting 
to S — 50 3 ~ 60 3 . The lattice is indexed by {§}f =1 , as depicted schematically in the two 
dimensional plane in Fig. [TJ The subset {s}^ =1 C {Sjfrf ~ 60 is the spatial region where 
{G s (f)} has non-zero values, depending on the given r. Since the indices introduced so far 
are complicated we summarize them in Table [H 

The value of Q s (f) at a s-lattice site, R s = (X s , Y s , Z s ), is given by the function depending 
on the distance between r and R s as, 

e.u = J^).Jy^).J^), (7, 



by 



6 



TABLE I: Conventions for indices used in this paper. 





Index 


Total Amount 


Reference 


Configurations 


a 


M ~ millions 


Eq. © 


Orbitals 


I 


L < N 


Eq. © 


Electrons 


3 


N 


§II.B 


Blip Grid (subset) 


s 


4 3 = 64 


Eq. © 


Blip Grid (whole) 


s 


S = 50 3 ~ 60 3 


Eq. © 



where ip (() is a second order polynomial in (, and (b x , b y , b z ) denote grid spacings for each 
direction. The coefficients in Eq. ([6]), {a^s}^ 250 ' 000 , are precomputed and provided as an 
input file, stored in memory at the beginning of the simulation. For each MC step with a 
updated particle position fj, the subset 

{(W^^t^lL (8) 

is identified and used in the summation ([6]). Denoting ai, s {r 3 ) — a [Z, j x , jy, j z ] as an array, 
s (^j) = {jxijyijz) forms a simply connected region in the three dimensional space but it does 
not allow sequential memory access in one dimensional address space, as it is interrupted 
with some stride due to the higher dimensions (see Fig. [p. The orbital index /, is, however, 
inherently one dimensional and we exploit this for sequential memory access, which is very 
important for GPGPU, as discussed in §IV.B. 




FIG. 1: Data structure for the expansion coefficients of the orbital functions in Eq. ©, schemat- 
ically depicted in two dimensional space. Black lattice points show the nearest site for each given 
fj. In the shaded regions the basis functions {Q s (rj)} in Eq. © have non-zero values. 
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We have identified the 'multiply and add' operations in Eq. fl6]), by which the orbital 
functions, {ipi (rj)}f =v are evaluated as the bottleneck in the present QMC simulation 



=(") 



— >• r 



This operation appears at every MC step when the particle position is updated, fj 
The number of operations for a single evaluation is proportional to the number of orbitals, 
L, and hence to N. In the present study we treat system sizes up to L = 384 and iV=l,536. 



C. Benchmark systems 



To investigate the dependence of the acceleration on system size, we prepared three 
different benchmark systems, as reported in Table [III For each system the atomic cores 
are replaced by pseudopotentials [s| in Si-diamond (iV=216) and cubic Ti0 2 (iV=648 and 
1,536). The periodic boundary conditions for (3 x 3 x 3) or (4 x 4 x 4) arrays of unit cells 
form a simulation cell. More detailed specifications for each system are given in Ref. 21] for 
Si and Ref. [22 1 for Ti0 2 . 



TABLE II: Benchmark systems used in the present study. N and L denote the number of electrons 
and orbitals for each system, respectively. The timing data and the acceleration factors achieved 
by the best coding are summarized, see §V.A. 



System 


N 


L 


CPU time (ms) 


GPU time (ms) 


Acceleration factor 


Si (3 x 3 x 3) 


216 


56 


2.77 


0.17 


16.58 


Ti0 2 (3x3x3) 


648 


168 


20.21 


0.82 


24.46 


Ti0 2 (4x4x4) 


1,536 


384 


100.00 


3.26 


30.67 



The bottleneck routine (orbital evaluation) to be replaced by the GPU processing kernel 
occupies 20~30% of the entire CPU time for Ti0 2 (iV=l,536), as analyzed by a profiler (Intel 



VTune Amplifier [23|). This depends on the choice of the 'dcorr' parameter in CASINO 

nn 

[6|, |24J, which specifies the interval between sampling; in order to reduce the correlation in 
the sampling, the local energy is evaluated every 'dcorr' MC steps (an MC step corresponds 
to the update of a configuration). The ratio of the CPU time spent in the bottleneck 
is reduced from 39.5% to 22.0% by increasing 'dcorr' from one to ten, as measured for a 
simulation with 10,000 MC steps. Typically 'dcorr = 5' is chosen, for which the reduction 
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becomes 27.5%. The reason for this dependence is that the orbital evaluation is called not 
only by the configuration updating but also by the local energy evaluation. Increasing 'dcorr' 
means less frequent evaluation of the local energy and hence less frequent calls to the orbital 
evaluation. 



III. GPU 



General descriptions of the architecture of a GPU can be found in the literature |2|-|4J, and 



our previous paper 



12] also provides such a description. This section provides the minimum 
amount of information required to understand the present work which was performed with 
the NVIDIA GeForce GTX 480 architecture. 

A GPU has hundreds of processing cores. The key points for the acceleration in the 
present work can be summarized as follows: (1) Parallelized tasks are distributed over many 
cores. The large number of processing cores of a GPU allows the whole task to be processed 
more rapidly than by a CPU. (2) The parallelized tasks are grouped into several sets (called 
'warps'). The GPU processes each warp in order ('barrel processing'), skipping those still 
waiting for data load from memory. As there are us ually hundreds of warps, barrel processing 
conceals the memory latency. In our previous work 12[ the acceleration was achieved mainly 
by dividing a huge number of loops into several subsets and distributing them over GPU 
processor cores. The present work does not follow this strategy, and we do not divide the 
loop for the summation in Eq. fl6]). Instead a huge number of independent parallel tasks, 
N x L = 1,536 x 384 = 589,824, for the orbitals <^ {tpi (f*-)}, =1 \ are distributed over 
the GPU processing cores. Other key points for the present work include, (3) memory 
latency is much improved when the access occurs with sequential memory address (memory 
coalescing), and (4) a command set called a 'Fused Multiply Add (FMA) which performs 
multiply and add operations within a clock cycle (two operations at once). 



A. Processors and performance 

GTX480 has 480 processor cores, each of which is termed a 'streaming core' (SP) for 
AMD products while 'cuda core' is used for NVIDIA products. In the present paper we use 
the term SP. The specs of the GTX480 are summarized in Table IIII1 As shown in Fig. [2] 
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every 32 SPs are grouped into a unit called a Streaming Multi-Processor (SM), the total 
number of which is hence fifteen. Each SP includes 32 bit scalar operators for floating point 
(FP32) and integer (Int32) data. These two operators can process the data independently 
within a clock cycle, giving a contribution to the ideal performance with 2 OP/cycle (two 
operations per clock cycle) for single precision operations. For double precision each 32 
bit operator deals with 64 bit floating point (integer) data with two clock cycles, termed 
FP64 (Int64), giving a 1 OP/cycle contribution on average for double precision operations. 
There is another kind of operation unit called a 'Special Function Unit' (SFU), devoted to 
evaluating hyper functions including exponential, logarithmic, and trigonometric functions. 
Each SM includes four SFUs in addition to the 32 SPs. A SFU performs four floating point 
operations per clock cycle, in parallel with other SM operations, which therefore contributes 
a further 4 OP/cycle to the ideal performance. With a clock frequency of 1.401 GHz, the 

TABLE III: Spec of the NVIDIA GTX480 GPU architecture used in the present work. 



Compute Capability 


2.0 


Clock of CUDA cores 


1401 MHz 


Global Memory 


1536 MB 


Memory Bandwidth 


177.4 GB/s 


Number of SM 


15 


Number of CUDA Cores 


480 


Constant Memory 


64 KB 


Shared / LI Memory 


16 or 48 KB per block 


Max number of Threads 


1024 per block 



peak performance of GTX480 is hence evaluated as 

1.401GHz x 15SM x (32SP x 20P + 4SFU x 40P) = 1, 681[GFlops] . 

Note that, unlike the previous GTX275, multiply-and-add operations are not subject to 
a SFU in the present GTX480. For evaluating Eq. (E]) there is hence no place for a SFU to 
be applied, giving an ideal performance to be compared with our achievement of 

1.401GHz x 15SM x (32SP x 20P) = 1, 345[GFlops] , 
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by omitting the contribution from SFU. Though the present work concentrates on single 
precision GPU evaluation, the ideal double precision performance is estimated to be 672 
[GFlops], which is half of that for single precision. GeForce GTX480, however, limits it to a 
quarter of this value, 168 [GFlops], by its driver, for some reason 12). These estimates are 
summarized in Table HVj compared with that of the CPU (Intel Core i7 920) used in the 
present work. 



Data From CPU 



GPU device 



^| 1.5GB Global Memory 



64KB Constant Memory (Read Only) 



768KB L2 Cache 



Streaming Multiprocessor #1 



64KB Shared Memory / Ll Cache 


Streaming 
Processor #1 




4KB Register 


Streaming 
Processor #2 


4KB Register 



Streaming Multiprocessor #15 



64KB Shared Memory /Ll Cache | 


Streaming 
Processor #1 




4KB Register 


Streaming 
Processor H2 


4KB Register 



FIG. 2: A schematic picture of the structure of the GPU device used in the present work. 



TABLE IV: Estimated ideal performances of GTX480 to be compared with our achievement. The 
peak performance of the Intel Core i7 CPU is also listed for reference. Note that the 'ideal' 
performance differs from the peak performance of the device (see the text). 





Clock freq. 


No. of 


Ideal Performance 




(GHz) 


cores 


(GFlops) 


GPU (GTX480) Single precision 


1.401 


480 


1,345 


GPU (GTX480) Double precision 


1.401 


240 


672 (limited 168) 


CPU (Intel Core i7 920) 


2.66 


4 


42.56 



Table IVl summarizes the specification of a computational node used for the experiments. 
On each node an Intel Core i7 920 processor 25[ and a GPU is mounted on a mother board. 
Hyper-Threading 26[ in the Core i7 processor is turned off, and it is used purely as a four- 
core CPU. Compute Capability specifies the version of hardware level controlled by CUDA, 
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above ver.1.3, which supports double precision operations. We used the Intel compiler 
version 12.0.0 for Fortran/C codes using options, '-03' (optimizations including those for 
loop structures and memory accesses), '-no-prec-div' and '-no-prec-sqrt' (acceleration of 
division and square root operations with slightly less precision), '-funroll-loops' (unrolling of 
loops), '-no-fp-port' (no rounding for float operations), '-ip' (interprocedural optimizations 
across files), and '-complex-limited-range' (acceleration for complex variables). For CUDA 
we used the nvcc compiler with options '-03' and '-arch=sm_13' (enabling double precision 
operations). 



TABLE V: Setup of a computational node. 



CPU 


Intel Core i7 920 2.66 GHz 


GPU 


NVIDIA GeForce GTX480 x 1 


Motherboard 


MSI X58M 


Memory 


DDR3-10600 2GB x 6 


OS 


Linux Fedora 13 


Fortran/C Compiler 


Intel Fortran/C Composer XE 12.0.0 


CUDA 


CUDA version 4.0 



B. Memory architecture 

It is essential in GPGPU coding to design efficiently the parallelized tasks (termed 
'threads') to be grouped into subsets with several different classes. With the variety of 
memory devices provided in a GPU, see Table IVll each subset has a different 'distance' from 
these devices. The performance of the GPGPU is critically affected by the choice of theses 
subsets because a good design can effectively reduce the memory latency. In the present 
study all of the threads are grouped into 'blocks'. Threads within a block can share memory 
devices with fewer latencies. 

Each block is assigned to a SM by which the threads within the block are processed. The 
SM processes 32 threads at once, as in vector processing. A bunch of 32 threads is called 
a 'warp'. When a warp accesses with sequential memory addresses, the latency is much 
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reduced (called 'coalescing'). To conceal the memory latency, the scheduler and dispatcher 
for warps monitor which warps are immediately executable (namely which ones have already 
completed their memory loads). Then the scheduled warps are processed sequentially by 
the SM. A schematic picture is shown in Fig. El 




[J=M lJ = Si [j = 6] [j = N] 




FIG. 3: A schematic picture showing the relation between blocks, threads, and streaming multi- 
processors (SM). 

Table [VI] contains only those kinds of memory relevant to this study, and excludes the 



texture memory 



Off-chip memories are located within the GPU board but not on the 
device chip. They have larger capacities and are accessible directly from CPU hosts but 
are in general slower. On-chip memories are complementary, namely with higher speed and 
lower capacity. Data required for GPU processing is transferred from the mother board to 
off-chip global memory, and is then loaded to on-chip shared memory, as usual. 

The capacity of the global memory ranges from 1 GB to 3 GB, depending on the product. 
As a trade-off against the large capacity it is about 100 times slower than on-chip memories. 
In GTX480, 768 KB of off-chip L2 cache is available to cover the low speed of the global 
memory. Another off-chip memory with high speed accessibility is the 64 KB constant 
memory. Via the constant caches located on every SM the constant memory can be accessed 
from all the threads with higher speed, although it is limited to read-only. As its name 
suggests, it is used to store constants referred to by threads. On-chip memories inside each 
SM include registers, shared memories, and LI caches. In GTX480 there are 32,768 registers 
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available for each SM. Registers are usually used to store the loop index variables defined 
within kernel codes, as in the present study. The 64 KB memory device on each SM can 
be shared by all the threads within a block with high speed access. The 64 KB capacity 
is divided into 48 KB and 16 KB parts which work as a shared memory and LI cache, 
respectively. The user can specify which 48 or 16 KB region corresponds to the shared 
memory or LI cache when the kernel code is compiled. The access to the global memory 
refers first to LI cache and then L2 and finally to the off-chip global memory device when 
it fails to load from cache, which are called cache misses. 

Data loading from the global memory takes at least 200 cycles, and more usually 400 ~ 
600 cycles. To conceal the latency, the GPU administrates all of the warps and monitors 
whether it is ready to be executed with the completion of data load. With sufficient warps 
one can ensure that the processors are almost always executing operations without waiting 
for data loads. To achieve better concealment it is essential to design the code so that it 
maintains a large number of warps. Since it depends on the specs of each architecture, such 
as the number of threads per warp, and the maximum possible number of threads per block 
etc., programming for better performance requires tuning for each GPU product. 





Location 


Cache 


R/W 


Availability 


Data maintained 


Register 


On-chip 




R/W 


within a thread 


during a thread 


Local memory 


Off-chip 


L1/L2 


R/W 


within a thread 


during a thread 


Shared memory 


On-chip 




R/W 


from all threads 
within a block 


during a block 


Global memory 


Off-chip 


L1/L2 


R/W 


from all hosts 
and threads 


during host code 
maintains 


Constant memory 


Off-chip 


Yes 


R 


from all hosts 
and threads 


during host code 
maintains 



TABLE VI: Various kinds of memory in a GPU relevant to this study. R and W stand for readable 
and writable, respectively. 
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IV. CODING 



Only the bottleneck routine for evaluating Eq. fl6]) is replaced by the CUDA kernel code 
executed on the GPU. The interface between the main code in F90 and the CUDA kernel 
is the same as that in our previous study 

A. Quasi-simultaneous updating 

To construct appropriate parallelized degrees of freedom, we introduced a new scheme 
for the MC updating of configurations. Let us denote a configuration at MC step a by 
i?( a ) = (f^ , ■ ■ ■ , rf , ■ ■ ■ , rjy^), and consider the update of a particle position fj — > rf +X ■ 
In configuration-by-configuration updating (simultaneous updating), the accept/reject of 
the updating is evaluated based on the ratio of the probabilities in Eq. (J2J), 

„/V a+1 ) ... d a +V d a +V d a +V ... J.°+i)\ 

P I ' 1 > ) ' j-X j'j >'j+l ' > N J 

^ sim = : u°) ... m . . . ' 

K I'l j > ' j ) j ' TV 

while in particle- by-particle (PbP, sequential updating), 

/ -ia+X) -^a+X) -(a+X) -{a) ^(q) n 

... P\'x > i' j-x i'j >'j+i> ''iv , 

Pyx ' > 'j-x ''j ''j+i' ''/v I 

The index j in £seq means that the accept/reject step is made for each particle move, unlike in 
configuration-by-configuration updating. These two updating schemes give slightly different 
values for the statistical estimates because of the different paths of the evaluations, but 
they coincide with each other within the statistical errors. We introduce another updating 
scheme based on the ratio 

_ P[r 1 , ,r j _ 1 ,r j ,r j+1 , ,r N j 



termed 'quasi-simultaneous updating' (Q.S.). In this scheme the accept/reject evaluation 
for the jth particle at step (a + 1) is based on the previously fixed a step configuration and 
on each particle position j, which gives N individual parallel tasks. 

Evaluating Eq. ( TTTT) reduces to computing the orbital functions with updated positions, 
l ~\ N 

| tyi J | | , which requires (N x L) independent evaluations. New trial moves 
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at the (a + 1) step 



f( Q b 



(rf +1) , 



(12) 



are generated on a CPU and sent to a GPU and the orbital functions are evaluated (see Fig. 
HP- For Ti0 2 (4x4x4) this gives N x L = 1536 x 384 = 589,824 parallelized tasks. Note 
that the parallelized multiplicity (N x L) scales as N 2 since the number of orbitals L oc N. 
The concealment of memory latency is more efficient when the multiplicity increases, and 
hence we expect better performance for larger systems. 



(CPU) 



Generate random walks 



Do j=l-N 

qf =D'/D 

accept/reject 

q J j =exp[V — j] 

accept/reject 

Enddo 



{ru(r>).v>)(^"C, 



(GPU) 




for determinants, 
for Jastrow factor. 



FIG. 4: Data transfer between a CPU and GPU in the present implementation. Trial moves for 
particle positions are generated and sent to a GPU. The GPU computes updated values of the 
orbitals to send back to the CPU. The accept /reject evaluation based on the orbital values are 
performed on the CPU. 



B. Assigning blocks 

Each thread evaluating ipi (jjj a+ j is labelled by (j, I). As mentioned in §11. B, the memory 
access for the coalescing should take / as the sequential index for the data load of a>i,s(fj) = 
a [l,jx,jy,jz], as required for Eq. (jBJ). We therefore assigned the threads sharing the same 
j with sequentially varying / = 1 ~ L within a block for the coalescing. The number 
of orbitals, L, cannot therefore exceed the maximum possible number of threads within a 
block, which is 1,024 for GTX480. The number L can usually be reduced to L < N using 
the symmetry of the system. It is then a natural choice to assign L within a block rather 
than N, which is always being larger than L. For non-magnetic solid systems as in the 
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present work, L is reduced to ~ N/4 by the symmetries, giving the limit N < 4,096 for 
GTX480. For magnetic systems, L ~ N/2, and hence N < 2,048, and for systems without 
time-reversal symmetry it becomes 1,024. This limitation is consistent with the maximum 
simulation size of contemporary QMC electronic structure calculations, which are able to 



treat at most a few thousand electrons in extended systems 211. Furthermore the limit of 

n 

1,024 in GTX480 is expected to double in future architectures [30J, making this issue less 
important. 

When evaluating Eq. ([6]), all the threads within a block refer to the same S (r}) with 
fixed j. Once a warp loads {Q s (r/)} 6 =1 , these data are stored in LI and L2 cache including 
their neighboring data. We can then expect effective cache hits for the data load. The 
operation of each thread, 64 terms multiply and add, easily fits the FMA of the GPU. 



C. Other code prepared for comparisons 

We prepared several other versions of the code with different updating schemes and 
thread/block assignments, in order to compare the performance. The original CPU im- 



plementation provided by the CASINO distribution jfj] is the particle-by-particle (PbP) 
algorithm, with double precision updating, which we refer to as [(0a) CPU/PbP]. Another 
version, [(1) GPU/PbP], is the GPU version of (0a) but with single precision updating, 
which is useful for studying deviations between single and double precision computations. 
In this version the GPU kernel is called with a single fixed j (PbP). Each term, ai s ■ Q(fj) 
in the summation of Eq. ([6]), labelled by s, is calculated by each thread, the sum of which 



is obtained by the reduction operation {4] among all threads. The threads indexed by s 
are grouped into those sharing the same / and hence the blocks are labelled by the orbital 
index I. Another reference is the version [(2) GPU/Q.S. /non-coalescing], which uses quasi- 
simultaneous (Q.S.) updating, but the same thread/block assignment as the version (1), 
namely each thread calculates only the product a/ s • ©(r}). In this case the blocks are la- 
belled by (j, I). Coalescing does not work in this version. The indices for threads and blocks 
are summarized in Table [VTT1 Comparing (1) and (2) shows firstly how the performance in 
speed is improved by the simultaneous data transfers for j = 1 ~ N in Q.S. compared to the 
sequential transfers in PbP. Secondly we can see how much the energy deviates due to using 
Q.S. updating instead of PbP. Our final implementation, described in §IV.B, is termed [(3) 
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GPU/Q.S. /coalescing]. The comparison between (1) and (2) is within the single precision 
treatment. To compare the Q.S. scheme in single and double precision we also prepared 
[(Ob) CPU/Q.S.], which is a double precision version of Q.S. 

V. RESULTS 

A. Acceleration performance 

Tables IVHI and IVIIII summarize the acceleration factors and computational time taken 
for the bottleneck kernel part within a MC step. The results are evaluated for systems with 
N = 216 (Si, 3x3x3) and 1,536 (Ti0 2 , 4x4x4). A better performance was obtained 
using implementation (3) rather than (2) and (1), and with larger system sizes N. The 

TABLE VII: Comparison of acceleration factors for each implementation evaluated from Table 
IVIIII 'PbP' and 'Q.S.' stand for the particle-by-particle and quasi-simultaneous updating schemes, 
respectively. The 'Index' column shows which indices in Eq. ([6]) are assigned to threads and blocks 
in the GPU. For system sizes N, see Table HI] for a more detailed description of the systems. 





Index 


acceleration factor 




block 


thread 


N = 216 


N = 1,536 


(1) GPU/PbP 


I 


s 


0.41 


1.47 


(2) GPU/Q.S. /non-coalescing 


1,3 


s 


6.39 


5.61 


(3) GPU/Q.S. /coalescing 


3 


I 


16.58 


30.67 



improvement from (1) to (2) is attributed to the increased number of threads in Q.S. due to 
processing the particle indices j = 1 ~ N simultaneously. This also brings about improved 
efficiency in the data transfer between the CPU and GPU, which is simultaneous transfer 
in version (2) and repeated transfers for each j in PbP version (1). The improvement from 
(2) to (3) arises because the number of operations performed on each thread is increased; 
multiply and add summation with 64 terms in (3) and just one multiplication in (2). The 
fact that memory coalescing only works in (3) also contributes to the improvement, for which 
a detailed analysis will be given in §VI.C. 
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TABLE VIII: Comparison of the computational times (ms) per Monte Carlo step in each imple- 
mentation. 'PbP' and 'Q.S.' stand for the particle-by-particle and quasi-simultaneous updating 
schemes, respectively. More information about the systems (of sizes N) are given in Table ILT1 





N = 


216 


N = 1,536 




CPU 


GPU 


CPU 


GPU 


(1) GPU/PbP 




6.76 




68.03 


(2) GPU/Q.S. /non-coalescing 


2.77 


0.43 


100.00 


17.83 


(3) GPU/Q.S. /coalescing 




0.17 




3.26 



B. Deviation in energy values 

In GPGPU for scientific simulations it is important to consider whether the deviation in 
the results due to the single precision operations are within the required accuracy. For the 
present electronic structure simulation the deviation should be within AE ~ 0.001 [hartree] 
in the energy estimation, known as the chemical accuracy. Table IIXI shows the results from 
each implementation, the ground state energy of Si, 3 x 3 x 3 (N = 216), by 1,000,000 MC 
steps with sampling every 10 MC steps ('dcorr' = 10, see §11. C). The comparison between 

TABLE IX: The ground state energy of Si, 3 x 3 x 3 (N = 216), from 1,000,000 MC steps, 
sampling at every 10 steps, evaluated by each implementation. 'PbP' and 'Q.S.' stand for the 
particle-by-particle and quasi-simultaneous updating schemes, respectively. 





Energy (hartree/primitiveCell) 


(0a) CPU/PbP (double prec.) 


-7.9590865 ± 0.0001749 


(1) GPU/PbP (single prec.) 


-7.9589381 ± 0.0001754 


(0b) CPU/Q.S. (double prec.) 


-7.9591560 ± 0.0001755 


(2,3) GPU/Q.S. (single prec.) 


-7.9592106 ± 0.0001698 



(0a) and (1) gives the deviation due to the change from double to single precision evaluation. 
The deviation is within the statistical error bars, but the agreement is poorer than single 
precision, which is expected to be correct to around six digits. To understand this we must 
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remember that the energies given in Table HXl are statistical estimates obtained from different 
accept/reject paths after 1,000,000 MC steps. When we look at the difference after a single 
MC step, agreement is confirmed within six digits, as shown in Table |Xl Even such small 
deviations may give rise to different decisions along a accept/reject branch. Once a different 
decision occurs, the subsequent random walk takes different paths, giving different estimates 
which are outside of the six digits but within the statistical error bars. 

TABLE X: Comparison between the energies after a MC step evaluated in each implementation. 
'PbP' stands for the particle- by-particle updating scheme. 





Energy (hartree) 


(0a) CPU/PbP (double prec.) 


-7.95075065699 


(1) GPU/PbP (single prec.) 


-7.95075065360 



A comparison of (0a) and (0b) in Table IIXI gives the deviation due purely to the two 
different updating schemes. As expected it is confirmed that these energies agree to within 
the statistical error bars. The result of (2,3) includes both the deviation due to the changes 
in the updating scheme and the numerical precision. Comparing with the reference (0a) 
demonstrates that our updating scheme keeps the result within the required accuracy. 

In the present study, the updated orbital values from the GPGPU/single precision evalu- 
ations are used only to determine if updated particle positions are accepted or rejected. The 
energies reported in Table ILXl were calculated using CPU/double precision. The difference 
between single and double precisions alters the paths of the random walks and hence where 
the R space is sampled. If the energies themselves were also evaluated using single precision 
it might introduce significant biases, but we have not investigated this here. As mentioned 
in §VI.d, there are further possibilities for accelerating the second largest bottleneck, which 
is the updating of the Slater determinants (or equivalents) using GPGPU. In this scheme 
the updated single precision value of the many-body wave function is used to evaluate the 
energies, and the errors in these energies should be considered carefully to make sure the 
results are still within chemical accuracy. 
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C. System size dependence of the performance 



The acceleration factors achieved by implementation (3) applied to iV = 216 (Si/2 x 2 x 2), 
648 (Ti0 2 /3 x 3 x 3), and 1,536 (Ti0 2 /4 x 4 x 4) are summarized in Table HH Figure [BJ also 
shows the acceleration factors and computational times (ms) taken for a MC step in (Oa), (1), 
and (3). As mentioned in §IV.A, the number of parallelized threads scales as (iVxi) ~ N 2 , 



120 



90 



E 60 



30 



(Oa)CPU I 

(1) GPU/PbP mmm 
(3) GPU/Q.S./coalescing mmm 
Acceleration factor ♦ 



216 



648 

Number of electrons 



1536 



30 
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FIG. 5: Computational time [ms] per MC step and corresponding acceleration factors for system 
sizes N (number of electrons). 'PbP' and 'Q.S.' stand for the particle-by-particle and quasi- 
simultaneous updating schemes, respectively. 



so that the efficiency of the GPU improves for larger systems. Coalescing in implementation 
(3) also becomes more effective for larger L. However there is a drawback for larger systems 
in the data transfer costs, especially for returning from the GPU to CPU. 

As overall cancellation, the acceleration seems to scale as ~ N , rather than ~ A^ 2 (the 
number of threads). Table IXll summarizes the data transfer times using ver. (3), which 
show that the transfer cost increases with system size. The percentage of the total GPU 
time is, however, decreased because the number of GPU operations also increases. 

Another remarkable fact illustrated by Fig.[5]is that the performance of ver. (1) is inferior 
up to N = 648 but superior for larger N = 1,536. The only possible reason for that is the 
concealment of memory latency, because in this implementation there is no coalescing and 
the operation load on each thread is simply a multiplication, although it has the largest 
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TABLE XI: Data transfer costs between the CPU and GPU from implementation (3) in Table IVlTl 
The dependence on the system size iV (number of electrons) is shown. 





CPU - 


■* GPU (/is) 


GPU -> CPU (jib) 


Total 


Percentage 


N 


s (3) 


V,) (*5) 


In 


(jjs) 


within total GPU time 


216 


16.8 


19.5 


29.1 


71.9 


47.6 % 


648 


18.8 


32.0 


290.9 


405.1 


45.8 % 


1,536 


25.5 


131.1 


1046.6 


1203.2 


36.9 % 



number of warps among all the implementations. 



VI. DISCUSSIONS 



To prevent redundancy we omit discussions of the evaluation of how much of the original 
CPU code is optimized, and of how we interpret the acceleration factors achieved in terms 
of the actual usefulness of GPGPU, as these were discussed in our previous report 121 ] . 



A. Acceleration performance 

The ideal limit of the acceleration factor to be compared with our achievement of 30.7 is 
that of (1345/10.64)=126.4, where 1345 [GFlops] is the ideal limit for the GPU discussed in 
§111. A and 10.64 is the single core performance of the Intel Core i7 processor used here for 
implementation (0a). This limit might be achieved when the ratio of the cost of memory 
loads to that for operations approaches zero, although this is not possible in practice. This 
ratio corresponds to those shown in the last column of Table IXII as in percentages. As 
shown in the table, the ratio is decreased for larger systems and hence we expect better 
performance. 

As another evaluation of our achievement, we estimated the FLOPS of our implementa- 
tions applied to Ti02 (A/=l,536) as listed in Table Kill The values are obtained from the 
number of operations required only for Eq. ([6]) [(2 operations) x (2 components in com- 
plex numbers) per term] divided by the time taken for the GPU kernel execution. We did 
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not take into account the operations required to identify which subset {s}^ =1 C {s}?"^ ~ 60 
should be chosen to form the coefficients {ai s }. The actual FLOPS should therefore be larger 
than those given in the table. In this evaluation, our achievement of a factor 30.7 gives an 

TABLE XII: Estimated FLOPS for each implementation and the ratios to the ideal performances. 
'PbP' and 'Q.S.' stand for the particle-by-particle and quasi-simultaneous updating schemes, re- 
spectively. 





Performance (GFlops) 


Ratio to Ideal performance 


(0a) CPU/PbP 


1.50 


14.10 % 


(I) GPU/PbP 


4.62 


0.34 % 


(2) GPU/Q.S./non-coalescing 


9.05 


0.71 % 


(3) GPU/Q.S./coalescing 


73.98 


5.50 % 



efficiency of only 5.5% of the ideal performance. For reference, the CUBLAS (GPGPU of 
BLAS Level 3) is known to give 400 GFlops on the NVIDIA Tesla C2050, which is 38.8% 
of the peak performance 27|j. The reason for the lower percentage in our case is the smaller 
number of operations per thread, 64 terms multiply and add summations for (3), and just 
a single multiplication for (1) and (2). The amount does not depend on the system size 
because of the use of a localized spline basis set, although this is a disadvantage for GPGPU 
in the sense that the number of operations is smaller. 



B. Performances in memory access 



Table IXIIII summarizes the memory load per 



sured by 'Compute Visual Profiler' for CUD A 
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ormances of each implementation as mea- 
. The increase in the number of 32 bit load 



from (1) to (2) is simply because of the increased number of threads due to the simultaneous 
processing with respect to the number of particles N in Q.S. From the coalescing in (3) we 
see a remarkable reduction in the number of memory loads by a factor of ~ 27. Correspond- 
ingly the throughput becomes much closer to the peak value, 177.4 GB/s, compared with 
the poor achievements in (1) and (2) due to the random access of {cij S }. 

The SM activity in Table IXIIII is a measure of the efficiency of the concealment of the 
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TABLE XIII: Performance in global memory access for each implementation. 'PbP' and 'Q.S.' 
stand for the particle-by-particle and quasi-simultaneous updating schemes, respectively. 





Global Mer 
Num. of 32bit Load 


nory Access 
Throughput (GB/s) 


SM activity 


(1) GPU/PbP 


24,576 


13.5 


88.1 % 


(2) GPU/Q.S. /non-coalescing 


188,744,000 


18.1 


99.4 % 


(3) GPU/Q.S. /coalescing 


7,077,890 


153.0 


100.0 % 



memory latency. This quantity is denned as the ratio of the number of cycles at which 
the operations started after the completion of memory loads to the total number of cycles 
taken for the kernel execution. With efficient concealing, at least one of the threads is 
always ready for execution at each cycle, and hence this quantity is expected to be close 
to 100%. Since the concealment becomes more effective with larger numbers of warps, the 
SM activity is increased from (1) to (2) by the increased number of threads. Though the 
number is decreased again in (3) by a factor of ~ 64, greatly accelerated memory accesses 
by the coalescing improve the SM activity to 100%. 



C. Shared memory and read only memory 



In our previous work (12J we found that it was quite effective to exploit high-speed read- 
only memory. In the present work we have investigated further improvements by exploiting 
high-speed read-only memory but found no significant gains. This is summarized as follows: 
(i) in the present case the data required for the operation on each thread is large and does 
not fit into the high-speed memory devices, and (ii) even without explicit use of high-speed 
devices, the compiler automatically assigns them to LI cache, and the explicit data transfer 
to shared memory by the user gives rather slower performance than automatic assignment. 

The data loads considered in the present case are {a^}?^ 250 ' 000 , which is initially stored 
in the global memory. After choosing the subset {{a;.s}^i 1 } i _ 1 from {a^g}, our best imple- 
mentation (3) loads them by coalescing access to the global memory. However, the access 



speed to the on-chip shared memory is around 100 times faster than that. 



3l|. A block 
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sharing a shared memory device has a common j so the set is shared by all 

the threads within the block. The 64 KB capacity of the device corresponds to 16,000 (= 
64 KB/4B) elements of single precision data. It is therefore possible to accommodate {a^ s } 
when L < 16, 000/64 = 250, so that the total number of orbitals is less than 250. Table HT1 
shows that Si (3 x 3 x 3) and TiC>2 (3x3x3) correspond to this case. Storing the data 
within the on-chip memories becomes advantageous when the data is referenced repeatedly 
by the warps. The number of repeated references is given in our case by the ratio of L ~ 250 
to the warp size, 32, which is less than 10. Beyond this number of repeats the SM switches 
to another process with different j and hence the corresponding new data for {a/ jS } should 
be loaded again from the global memory. We have tried such an implementation using 
shared memory devices but we did not find any improvements over (3), possibly because of 
the small number of repeated references. Such an improvement would already be included 
implicitly in (3) by LI cache acting on the shared device. The explicit usage of the device 
as the shared memory seems to reduce the performance. 

Another choice is to use constant memory. Unlike shared memory it is located off chip 
and blocks with different j indices pick up their subset {a>i )S }. The memory should there- 
fore be able to provide the whole set of {a^}, which is far beyond the size of the con- 
stant memory and is therefore infeasible. Another data required for the GPU operation 
is |{6 S (rj)}j =1 j in Eq. ([6]), the total size of which can be accommodated within the 
constant memory for N < 250. Our trial implementation using constant memory for {O s } 
actually gives a slight improvement in performance, by ~ 4%, but this is only applicable to 
smaller system sizes. The data jo s (^f^)} is updated at every MC step and hence the 
life time in cache is short, giving only a slight improvement in performance. 



D. Acceleration of Slater Determinant Updating 

The bottleneck operation of updating of orbital functions, which are computed by the 
GPU, gives a system size dependence of 0(N 2 ) 6j. The corresponding updating of the many- 
body wave function, (jSJ), actually takes 0(N 2 ) in the PbP implementation [?], Q], in spite of 
the fact that evaluating a determinant scales as 0(N 3 ). In PbP, say r} — > r^, only one column 
of the determinant including r* is updated at each step. The updating of the determinant 
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can hence be evaluated based on the Laplace expansion with respect to the column, and the 



other co-factors are unchanged. This algorithm, known as Sherman-Morrison updating 129 1 

ri 

therefore requires only 0(N 2 ) operations [6J. In our Q.S., the updated orbitals are stored in 
an array and are read sequentially to update the determinant using this algorithm. The cost 
of the updating of the many-body wave function amounts to 19.4% of the time compared 
to 39.5 % for the present bottleneck, the orbital updating, when JV = 1,536, dcorr=l. The 
ratio of the former to the latter increases with larger dcorr. This operation mostly involves 
BLAS routines (ddot and daxpy) which could be replaced by CUBLAS in future work. 



E. As a prototype of linear algebra problems 



The evaluation of Eq. fl6]) can be written in terms of linear algebra, 



( ^ ft) \ 

^2 ft) 



A ft) • 



/ 8i ft) ^ 

©2 ft) 



^ ©64 ft) J 



(13) 



ft) J 

for which the matrix A is defined by 

A = {a is } = {ai jS (f.)} = A ft) . 
If the matrix A is constant, we can write Eq. f JT3|) as a matrix multiplication: 

( 0! ft) e 1 ft) • • • 0! ftv) \ 

A ■ 



(14) 



( -01 fti) -01 (r 2 ) ■ ■ ■ i/ji ftv) 
2 fti) '•• : 



02 fa} 



(15) 



^i(^l) 1pL(r N ) J \^ 064 fti) ••• ©64 ftv) J 

and using this we could increase the ratio of operations to data transfer, and hence the 
efficiency of GPGPU by a large amount. Fig. [TJ however, reminds us that the matrix A 
in Eq. ()14p is randomly varying in its elements choice indexed by fj, but within the given 
constant elements and hence we cannot reduce Eq. ([TBI into a unified form as Eq. 

(in. 



Since our calculation of Eq. ([6]) is linear algebra we considered whether it could be per- 
formed efficiently using CUBLAS. To use CUBLAS we have to construct A (fj) on the CPU 
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at every MC step, transfer it to the GPU, and then call CUBLAS. In our case the randomly 
varying matrix A is not large, however, the cost of constructing it cannot be compensated 
by the high performance of CUBLAS. Since the enlarged matrix A = {ais}~ =1 is a constant 
matrix, only one data transfer to the GPU is required in principle, but this is not possible 
with CUBLAS which requires the operands to be transferred every time for the operations. 
In this sense, our achievement here corresponds to an effective GPGPU implementation of 
the algorithm for the prototype as explained above, Eq. ffTB"]) with a randomly chosen matrix. 

VII. CONCLUDING REMARKS 

We have applied GPGPU to the evaluation of the orbital functions in ab-initio Quan- 
tum Monte Carlo electronic structure calculations, which we identified as the computational 
bottleneck. For efficiency we propose a new updating scheme for generating trial moves for 
the walkers in the Monte Carlo sampling, which we call quasi- simultaneous updating. Using 
this scheme we achieved a speedup of more than a factor of 30 compared with using a sin- 
gle core CPU. The GPGPU implementation gives a deviation in energy from original CPU 
evaluation which is smaller that the required chemical accuracy. Though the effective per- 
formance in operations amounts to 74 GFlops, which is only 5.5% of the peak performance, 
the memory throughput reaches 153 GB/s, which is 86% of the peak value with almost 
perfect concealment of memory latency as shown by the SM activity. The implementation 
presented here is a prototypical problem of linear algebra with a sort of random matrix, 
processed by GPGPU. 
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